|
Cross-validation, sometimes called rotation estimation, is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of ''known data'' on which training is run (''training dataset''), and a dataset of ''unknown data'' (or ''first seen'' data) against which the model is tested (''testing dataset'').〔(【引用サイトリンク】title=Newbie question: Confused about train, validation and test data! )〕 The goal of cross validation is to define a dataset to "test" the model in the training phase (i.e., the ''validation dataset''), in order to limit problems like overfitting, give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem), etc. One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the ''training set''), and validating the analysis on the other subset (called the ''validation set'' or ''testing set''). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds. Cross-validation is important in guarding against testing hypotheses suggested by the data (called "Type III errors"), especially where further samples are hazardous, costly or impossible to collect. Furthermore, one of the main reasons for using cross-validation instead of using the conventional validation (e.g. partitioning the data set into two sets of 70% for training and 30% for test) is that the error (e.g. Root Mean Square Error) on the training set in the conventional validation is not a useful estimator of model performance and thus the error on the test data set does not properly represent the assessment of model performance. This may be because there is not enough data available or there is not a good distribution and spread of data to partition it into separate training and test sets in the conventional validation method. In these cases, a fair way to properly estimate model prediction performance is to use cross-validation as a powerful general technique. In summary, cross-validation combines (averages) measures of fit (prediction error) to correct for the optimistic nature of training error and derive a more accurate estimate of model prediction performance.〔 ==Purpose of cross-validation== Suppose we have a model with one or more unknown parameters, and a data set to which the model can be fit (the training data set). The fitting process optimizes the model parameters to make the model fit the training data as well as possible. If we then take an independent sample of validation data from the same population as the training data, it will generally turn out that the model does not fit the validation data as well as it fits the training data. This is called overfitting, and is particularly likely to happen when the size of the training data set is small, or when the number of parameters in the model is large. Cross-validation is a way to predict the fit of a model to a hypothetical validation set when an explicit validation set is not available. Linear regression provides a simple illustration of overfitting. In linear regression we have real ''response values'' ''y''1, ..., ''yn'', and ''n'' ''p''-dimensional vector ''covariates'' ''x''1, ..., ''xn''. The components of the vectors ''x''i are denoted ''x''i1, ..., ''x''ip. If we use least squares to fit a function in the form of a hyperplane ''y'' = ''a'' + ''β''T''x'' to the data (''x''i, ''y''i)1≤i≤n, we could then assess the fit using the mean squared error (MSE). The MSE for a given value of the parameters ''a'' and ''β'' on the training set (''x''i, ''y''i)1≤i≤n is : It can be shown under mild assumptions that the expected value of the MSE for the training set is (''n'' − ''p'' − 1)/(''n'' + ''p'' + 1) < 1 times the expected value of the MSE for the validation set (the expected value is taken over the distribution of training sets). Thus if we fit the model and compute the MSE on the training set, we will get an optimistically biased assessment of how well the model will fit an independent data set. This biased estimate is called the ''in-sample'' estimate of the fit, whereas the cross-validation estimate is an ''out-of-sample'' estimate. Since in linear regression it is possible to directly compute the factor (''n'' − ''p'' − 1)/(''n'' + ''p'' + 1) by which the training MSE underestimates the validation MSE, cross-validation is not practically useful in that setting (however, cross-validation remains useful in the context of linear regression in that it can be used to select an optimally regularized cost function). In most other regression procedures (e.g. logistic regression), there is no simple formula to make such an adjustment. Cross-validation is, thus, a generally applicable way to predict the performance of a model on a validation set using computation in place of mathematical analysis. 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Cross-validation (statistics)」の詳細全文を読む スポンサード リンク
|